This report explores a dataset cotaining quality of white wine and attributes for approximately 4900 white wine.
## [1] 4898
## [1] 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Our dataset consists of 13 variables, with almost 4,900 observations.
The distribution of white wine quality look like normal distribution. Why is the normal distribution? I wonder if whine wine quality decide by accident? I wonder what this plot looks like across attributes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The smallest amount of fixed acidity is 3.8 and the largest is 14.2. Above, I plot main body of the amount of fixed acidity. The distribution of this variable is alomost normal distriburion. I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And fixed.acidity have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The smallest amount of Volatile acidity is 0.08 and the largest is 1.1. Above, I plot main body of the amount of Volatile acidity. The distribution of this variable is alomost normal distriburion too. I wonder this valiable don’t effect wine qulity that’s why it is same reason of fixed acidity. And volatile acidity have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The smallest amount of citric.acid is 0 and the largest is 1.66. Above, I plot main body of the amount of citric acidity. The distribution of this variable is alomost normal distriburion too. But, the distribution has one big spike near 0.5.I wonder what the spike has something of feature or the outliers have something of features. And citric acidity have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Transformed the long tail data to better understand the distribution of residual sugar. The tranformed residual sugar distribution appears bimodal with the residual sugar peaking around 1.5 or so and again at 8.0 or so. I wonder what each peak effect quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The smallest amount of chlorides is 0.009 and the largest is 0.346. Above, I plot main body of the amount of clorides. The distribution of this variable is alomost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And cholrides have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The smallest amount of free.sulfur.dioxide is 2.0 and the largest is 289.0. Above, I plot main body of the amount of free.sulfur.dioxide. The distribution of this variable is almost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And free.sulfur.dioxide have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The smallest amount of total.sulfur.dioxide is 9.0 and the largest is 440.0. Above, I plot main body of the amount of free.sulfur.dioxide. The distribution of this variable is almost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And total.sulfur.dioxide have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Most wine have a density between 0.991 g/cm^3 and 0.997 g/cm^3: median 0.9937 g/cm^3 and mean 0.9940 g/cm^3. And density have a little outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Most wine have a pH between 2.85 and 3.6: median 3.18 and mean 3.188.The distribution of this variable is almost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And pH have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Transformed the long tail data to better understand the distribution of sulphates. The tranformed sulphates distribution appears normal. but this distribution has one spike near 0.5. What is this peak? And the log10 transformed sulphates have some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Most wine have a alcohol between 8.5 % and 13 %: median 10.4% and mean 10.51 %. This distribution is little right skewed. I wonder what center of quality is less than 10%.
##
## (0,4] (4,7] (7,10]
## 183 4535 180
Each variable don’t have the large difference of the distribution by quality condition. I wonder if residual.sugar have best conbination of other variables in low condition or high conditon.
Each variable don’t have the large difference of the distribution by quality condition.
Alcohol have the difference of the distribution by quality. High rate of alcohol tend to be better quality. I wonder if high rate of alcohol effect quality to be better.
There are 4,898 wine in the dataset with 13 variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). All variables are numeric.
Other observation:
The main feature of this dataset is quality of wine. We need to know the effect of other variables to the value. I’d like to determine which features are best for predicting the quality of white wine. I suspect alcohol and some combination of the other variables can be used to build a predictive model for wine quality.
Alcohol likely contribute to the quality of wine. I think all other variables support my investigation into quality. Because wine taste is decided by the conbination of wine ingredient.
I created a variable for the category of quality. By this variable, I did check the distribution of each variables to know effect of each variable into quality.
I log-transformed the right skewed residual.sugar and volatile.acidity. The tranformed distribution for residual.sugar appears bimodal with the residual sugar peaking around 1.5 or so and again around 8.0.
FALSE fixed.acidity volatile.acidity citric.acid
FALSE fixed.acidity 1.00000000 -0.02269729 0.289180698
FALSE volatile.acidity -0.02269729 1.00000000 -0.149471811
FALSE citric.acid 0.28918070 -0.14947181 1.000000000
FALSE residual.sugar 0.08902070 0.06428606 0.094211624
FALSE chlorides 0.02308564 0.07051157 0.114364448
FALSE free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
FALSE total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
FALSE density 0.26533101 0.02711385 0.149502571
FALSE pH -0.42585829 -0.03191537 -0.163748211
FALSE sulphates -0.01714299 -0.03572815 0.062330940
FALSE alcohol -0.12088112 0.06771794 -0.075728730
FALSE quality -0.11366283 -0.19472297 -0.009209091
FALSE residual.sugar chlorides free.sulfur.dioxide
FALSE fixed.acidity 0.08902070 0.02308564 -0.0493958591
FALSE volatile.acidity 0.06428606 0.07051157 -0.0970119393
FALSE citric.acid 0.09421162 0.11436445 0.0940772210
FALSE residual.sugar 1.00000000 0.08868454 0.2990983537
FALSE chlorides 0.08868454 1.00000000 0.1013923521
FALSE free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
FALSE total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
FALSE density 0.83896645 0.25721132 0.2942104109
FALSE pH -0.19413345 -0.09043946 -0.0006177961
FALSE sulphates -0.02666437 0.01676288 0.0592172458
FALSE alcohol -0.45063122 -0.36018871 -0.2501039415
FALSE quality -0.09757683 -0.20993441 0.0081580671
FALSE total.sulfur.dioxide density pH
FALSE fixed.acidity 0.091069756 0.26533101 -0.4258582910
FALSE volatile.acidity 0.089260504 0.02711385 -0.0319153683
FALSE citric.acid 0.121130798 0.14950257 -0.1637482114
FALSE residual.sugar 0.401439311 0.83896645 -0.1941334540
FALSE chlorides 0.198910300 0.25721132 -0.0904394560
FALSE free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
FALSE total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
FALSE density 0.529881324 1.00000000 -0.0935914935
FALSE pH 0.002320972 -0.09359149 1.0000000000
FALSE sulphates 0.134562367 0.07449315 0.1559514973
FALSE alcohol -0.448892102 -0.78013762 0.1214320987
FALSE quality -0.174737218 -0.30712331 0.0994272457
FALSE sulphates alcohol quality
FALSE fixed.acidity -0.01714299 -0.12088112 -0.113662831
FALSE volatile.acidity -0.03572815 0.06771794 -0.194722969
FALSE citric.acid 0.06233094 -0.07572873 -0.009209091
FALSE residual.sugar -0.02666437 -0.45063122 -0.097576829
FALSE chlorides 0.01676288 -0.36018871 -0.209934411
FALSE free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
FALSE total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
FALSE density 0.07449315 -0.78013762 -0.307123313
FALSE pH 0.15595150 0.12143210 0.099427246
FALSE sulphates 1.00000000 -0.01743277 0.053677877
FALSE alcohol -0.01743277 1.00000000 0.435574715
FALSE quality 0.05367788 0.43557472 1.000000000
The most of valiables correlate roughly with quality. The top 3 of strong correlation are alcohol(0.44), density(-0.31) and chlorides(-0.21).
Adding jitter, transparency, and changing the plot limits let us see the positive corelation between total.sulfur.dioxide and free.sulfur.dioxide. This relationship occur for the reason that total.sulfur.dioxide include free.sulfur.dioxide
Adding jitter, transparency, and changing the plot limits let us see the positive corelation between alcohol and quality.
Adding jitter, transparency, and changing the plot limits let us see the negative corelation between density and quality.
Adding jitter, transparency, and changing the plot limits let us see the negative corelation between chlorides and quality.
Alcohol correlate closely with density. This is general thing. Material property is decided by the element. I think varience occur by measurement method and difference of the environment in test. Therefore, I think we should use only one alcohol or density to predict wine quality.
I ploted each density distribution of volatile.acidity by quality level(bad, normal, good). I transformed the long tail data to better understand the distribution of volatile.acidity. From this plot, I can understand there are small differenct of the distribution pattern by each quality. names(wine)
I ploted each density distribution of residual.sugar by quality level(bad, normal, good). I transformed the long tail data to better understand the distribution of residual.sugar. From this plot, I can see small differenct of the distribution pattern by each quality. These distributions exist two regions are low and high residual.sugar. In low residual.sugar region, better quality wine is high residual.sugar. In high residual.sugar region, better quality wine is low residual.sugar.
I ploted each density distribution of alcohol by quality level(bad, normal, good). From this plot, I can understand there are a lot of better quality wine in high alcohol. But, I can’t recognize the difference between normal and bad. Which variable dicide the difference. I’ll check other density distirbution of variables.
I ploted each density distribution of free.sulfur.dioxide by quality level(bad, normal, good). By free.sulfur.dioxide, I recognize the diffrence of the distribution pattern between bad and more than it. I transformed the long tail data to better understand the distribution of free.sulfur.dioxide.
Alcohol is key of prediction for wine quality. Because alcohol correlate the most closely with quality in variables. Some variables correlate with alcohol. Therefore I get them out from the variables for wine quality prediction. From the result, I think I need volatile.acidity. Density look like better correration factor, but it isn’t included in the variables for wine quality prediction that’s why it is material property that correlates with some other variables that are ingredient for wine.
The positive corelation occur between total.sulfur.dioxide and free.sulfur.dioxide. Total.sulfur.dioxide include free.sulfur.dioxide. Therefore, it’s usual relationship. Good quality wine exist a lot in the region of high alcohol rate. But, it couldn’t only recognize the difference of the distribution pattern between normal and bad for wine quality. I found the variable that is free.sulfur.dioxide. Each density distribution of log transformed residual.sugar by quality level(bad, normal, good). In each quality level, each density distibution have same pattern that is bimodal. But, near each maximum point in the bimodal distribution, each density distribution have the difference.In low residual.sugar region, better quality wine is high residual.sugar. In high residual.sugar region is opposite.
The strongest relationship that is 0.83896645 occur residual.sugar and density.
I’d like to get information that which variables decide the better wine. In the Bivariate Plots Section, I knew alcohol is better variable.
I ploted the scatter plot with alcohol and log transformed free.sulfur.dioxide divided by the level of quality. From this graph, I’m able to understand the effect of free.sulfur.dixide on each quality. By high free.sulfur.dioxide, quality is increased, but the effect is small on low alcohol.
I ploted the box plot with alcohol and log transformed free.sulfur.dioxide divided by the level of quality. From this graph, I’m able to understand the variance of free.sulfur.dixide on each quality. On each alcohol bucket, high quality tend to decrease the variance of log transformed free.sulfur.dioxide.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide),
## data = wine)
## m3: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide) +
## log10(volatile.acidity), data = wine)
## m4: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide) +
## log10(volatile.acidity) + log10(residual.sugar), data = wine)
## m5: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide) +
## log10(volatile.acidity) + log10(residual.sugar) + chlorides +
## total.sulfur.dioxide + density + pH + sulphates, data = wine)
## m_base: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = wine)
##
## ==================================================================================================================
## m1 m2 m3 m4 m5 m_base
## ------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** 1.080*** 0.360** -0.030 44.163*** 150.193***
## (0.098) (0.134) (0.136) (0.142) (9.594) (18.804)
## alcohol 0.313*** 0.347*** 0.354*** 0.385*** 0.292*** 0.193***
## (0.009) (0.009) (0.009) (0.010) (0.017) (0.024)
## log10(free.sulfur.dioxide) 0.771*** 0.704*** 0.584*** 0.744***
## (0.048) (0.047) (0.048) (0.059)
## log10(volatile.acidity) -1.280*** -1.404*** -1.257***
## (0.074) (0.075) (0.077)
## log10(residual.sugar) 0.272*** 0.499***
## (0.031) (0.050)
## chlorides -0.894 -0.247
## (0.527) (0.547)
## total.sulfur.dioxide -0.002*** -0.000
## (0.000) (0.000)
## density -44.587*** -150.284***
## (9.582) (19.075)
## pH 0.284*** 0.686***
## (0.074) (0.105)
## sulphates 0.506*** 0.631***
## (0.097) (0.100)
## fixed.acidity 0.066**
## (0.021)
## volatile.acidity -1.863***
## (0.114)
## citric.acid 0.022
## (0.096)
## residual.sugar 0.081***
## (0.008)
## free.sulfur.dioxide 0.004***
## (0.001)
## ------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.230 0.275 0.286 0.301 0.282
## adj. R-squared 0.190 0.230 0.275 0.286 0.299 0.280
## sigma 0.797 0.777 0.754 0.748 0.741 0.751
## F 1146.395 732.923 618.802 491.077 233.466 174.344
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5713.108 -5567.036 -5528.056 -5478.898 -5543.740
## Deviance 3112.257 2955.840 2784.692 2740.720 2686.255 2758.329
## AIC 11684.782 11434.216 11144.072 11068.111 10979.797 11113.480
## BIC 11704.272 11460.202 11176.555 11107.091 11051.259 11197.936
## N 4898 4898 4898 4898 4898 4898
## ==================================================================================================================
The variables in this linear model with the log transformation of free.sulfur.dioxide, volatile.acidity and residual.sugar can account for 30.1% of the variance in the quality of wine, compared to 28.2% without the transformation. I get two variables that fixed.acidity and citric.acid out, that why don’t have the correlation with wine quality.
By the scatter plot with alcohol and log transformed free.sulfur.dioxide, I could recognize quality is increased on high free.sulfur.dioxide, but the effect is small on low alcohol.
By the box plot with alcohol and log transformed free.sulfur.dioxide, the variance of the log transformed free.sulfur.dioxide is tended to decrease on each wine quality.
Yes, I created a linear model starting from the quality of wine and alcohol and some log transformed variables(free.sulfur.dioxide, volatile.acidity and residual.sugar) and some normal variables(chlorides, total.sulfur.dioxide, pH, sulphates).
The distribution of wine quality appears to be normal. This is natural distribution. Everyone don’t want to make bad wine. But it is hard to make good wine.
Wine quality is mainly decided by alcohol and free.sulfur.dioxide. The better wine tend to have high alcohol rate. And bad wine tend to have low free.sulfur dioxide. Free.sulfur dioxide is added for protecting the oxidation of wine. Therefore, perhaps, it indicates a lot wine need the countermeasure of the oxidation for better wine quality.
The plot indicates that wine qulity is mainly predicted by alochol and transformed free.sulfur.dioxide by using linear model.By the transformed free.sulfur.dioxide difference between high and low, quality change about one.
The white wine data set contains information on almost 4,900 white wine across twelve variables. I started by understanding the individual variables in the data set, and then I explored questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wine across many variables and created a linear model to predict wine quality.
There was a strong trend between the alcohol and quality in dataset variables. High alcohol tended to be better quality of wine. But, wine quality wasn’t only explain alcohol that have correlation factor 0.436. To predict wine quality, some log transformed variables and no transformed variables were needed. Therefore, I checked the density distribution of alcohol by each quality level. And I confirmed the distribution pattern being only able to recognize better quality wine. From the result, I looked for the distribution pattern being able to recognize bad quality wine. That was the log transformed free.sulfur.dioxide. And I transformed volatile.acidity and residual.sugar to log too for better understanding. For creating a linear model, I used alochol, three log transformed variables and four variables(chlorides, total.sulfur.dioxide, pH, sulphates) have a little correlatin with wine quality. The model was able to account for 30.1% of the variance in the dataset. I feel this R squared value is small to predict wine quality. Perhaps, it is hard to predict wine quality that have the non-linear relationship with some variables by using linear model that have small discreption capacity.
Some limitations of this model include the source of the data. Given that the white wine date until 2009, the model would likely undervalue white wine in the market today, either due to changes in evaluation method and estimator. To investigate this data further, I would be interested in testing the non-linear model to predict wine quality. And I’d like to increase model prediction accuracy. I’m also interested in more recent dataset how change the trend of the relationship with wine quality and other variables.